Proposal

Assessment of the state of science is a requirement for research. Assessment of intellectual property is a requirement for capitalization of knowledge for profit. This project explores tools for surveying patent data and research data in order to develop methods for making these assessments.

The project begins with an examination of data from NASA, whose open data portal includes a dataset of NASA patents. The format is a delimited text file. The data include a patent case number and title. Exploration of this dataset may include unsupervised learning to yield topics of patents.

The World Intellectual Property Organization (WIPO) publishes The WIPO Manual on Open Source Patent Analytics with instruction for data interrogation useful for subsequent parts of this project. The project will continue with a more detailed exploration of patent cases opened by NASA or another science-focused organization with a view to revealing more topics from within patent case data other than the title of the case, perhaps from the patent application, for example.

There are at least two data sources for patent data. The primary data source is the U.S. Patent and Trademark Office (USPTO), whose Open Data Portal publishes many datasets in various formats. USPTO also supplies APIs for patent data retrieval. Besides these APIs, there is at least one potentially useful third party API called PatentsView.

One resource for research publications is the Public Library of Science (PLOS), who publish open access peer reviewed articles. The inventory of articles seems limited, but appears to enable an introduction to searching for research papers for which we will utilize the R package, rplos, a programmatic interface to the Solr based search API provided by PLOS. Solr is an open source search platform built on Apache Lucerne.

Our research topic is quantification and characterization of NASA innovation through inspection of NASA’s patents. For example, we expect to count patents which cite NASA patents (quantification) and to mine topics from NASA patents (characterization). We will report and visualize these findings.

The motivation for this project relates to methods for Information Retrieval and the team’s collective interest and experience in startup and scientific industries. Efficient topic modelling can form the foundation of a number of advanced NLP techniques such as Word sense disambiguation, Natural Text Generation, and automatic Summarization on its own. A number of derivative analysis are possible based on NLP, including studies combining our results with industry or innovation metrics.

NASA Patents

Data preparation notes

  • Once you determine the data types of your columns,set the specifications and then use them for reading.
  • readr blog post. Illustrates the approach, although this syntax doesn’t work anymore. The approach writes the specification to a file.
  • readr - how to update col_spec object from spec(). Greg H.’s answer on Stack Overflow doesn’t require writing to a file. We use this approach.
## Parsed with column specification:
## cols(
##   Center = col_character(),
##   Status = col_character(),
##   `Case Number` = col_character(),
##   `Patent Number` = col_character(),
##   `Application SN` = col_character(),
##   Title = col_character(),
##   `Patent Expiration Date` = col_character()
## )

Retrieve USPTO patent data

## $response
## $response$numFound
## [1] 2
## 
## $response$start
## [1] 0
## 
## $response$docs
##   applicationType      documentId applicationNumber documentType
## 1         UTILITY US20110212334A1        US13033085  application
## 2         UTILITY     US8623253B2        US13033085        grant
##        publicationDate         documentDate       productionDate
## 1 2011-09-01T00:00:00Z 2011-09-01T00:00:00Z 2011-08-17T00:00:00Z
## 2 2014-01-07T00:00:00Z 2014-01-07T00:00:00Z 2013-12-24T00:00:00Z
##        applicationDate
## 1 2011-02-23T00:00:00Z
## 2 2011-02-23T00:00:00Z
##                                                                                      applicant
## 1 Jolley, Scott T., Gibson, Tracy L., Williams, Martha K., Parrish, Clyde F., Parks, Steven L.
## 2 Jolley, Scott T., Gibson, Tracy L., Williams, Martha K., Parrish, Clyde F., Parks, Steven L.
##                                                                                       inventor
## 1 Jolley, Scott T., Gibson, Tracy L., Williams, Martha K., Parrish, Clyde F., Parks, Steven L.
## 2 Jolley, Scott T., Gibson, Tracy L., Williams, Martha K., Parrish, Clyde F., Parks, Steven L.
##                                                                                                                assignee
## 1    United States of America as Represented by the Administrator of the National aeronautics and, Space Administration
## 2 The United States of America as Represented by the Administrator of the National Aeronautics and Space Administration
##                                                     title
## 1 Low-Melt Poly(amic Acids) and Polyimides and Their Uses
## 2 Low-Melt Poly(amic Acids) and Polyimides and Their Uses
##                                                                               archiveUrl
## 1 https://bulkdata.uspto.gov/data/patent/application/redbook/fulltext/2011/ipa110901.zip
## 2       https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2014/ipg140107.zip
##      pdfPath year    _version_ patentNumber
## 1 NOTUPDATED 2011 1.622101e+18         <NA>
## 2 NOTUPDATED 2014 1.626650e+18      8623253
## [1] "Downloads complete."

section to answer question, “how many other patents does a patent reference (‘count.citations’), and how many other patents reference it (‘count.cited_by’)?” This question is a analagous to, “how many shoulders does a given patent stand on?”

todo: make sure it works across all dataframes; try incorporating Tommy’s snippet

We observe that each patent includes citations to reference other patents, as well as those other patents that cite it. We then ask if it might be possible to identify counts of each to demonstrate, for instance, the count of citations over time

This analysis will help us begin to answer the question, “how, and to what extent, do patents build on other patents?” Subsequent analysis should account for patent density across time, as well as the calculation of a rough proxy metric of the referenciblity of each patent (how much more or less is a given patent cited?).

clean text and output to df

## Warning in mapply(FUN = f, ..., SIMPLIFY = FALSE): longer argument not a
## multiple of length of shorter
##        appId
## 1  US464750A
## 2 US5585083A
## 3 US5606014A
## 4 US5617873A
## 5 US5632841A
##                                                                                                        title
## 1                                                                Button-hole-scissors gage  - Google Patents
## 2                                                               Catalytic process for formaldehyde oxidation
## 3            Imide oligomers and co-oligomers containing pendent phenylethynyl groups and polymers therefrom
## 4 Non-invasive method and apparatus for monitoring intracranial pressure and pressure volume index in humans
## 5                                              Thin layer composite unimorph ferroelectric driver and sensor
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                abstract
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Disclosed is a process for oxidizing formaldehyde to carbon dioxide and water without the addition of energy. A mixture of formaldehyde and an oxidizing agent (e.g., ambient air containing formaldehyde) is exposed to a catalyst which includes a noble metal dispersed on a metal oxide which possesses more than one oxidation state. Especially good results are obtained when the noble metal is platinum, and the metal oxide which possesses more than one oxidation state is tin oxide. A promoter (i.e., a small amount of an oxide of a transition series metal) may be used in association with the tin oxide to provide very beneficial results.\n  
## 2 Controlled molecular weight imide oligomers and co-oligomers containing pendent phenylethynyl groups (PEPIs) and endcapped with nonreactive or phenylethynyl groups have been prepared by the cyclodehydration of the precursor amide acid oligomers or co-oligomers containing pendent phenylethynyl groups and endcapped with nonreactive or phenylethynyl groups. The amine terminated amide acid oligomers or co-oligomers are prepared from the reaction of dianhydride(s) with an excess of diamine(s) and diamine containing pendent phenylethynyl groups and subsequently endcapped with a phenylethynyl phthalic anhydride or monofunctional anhydride. The anhydride terminated amide acid oligomers and co-oligomers are prepared from the reaction of diamine(s) and diamine containing pendent phenylethynyl group(s) with an excess of dianhydride(s) and subsequently endcapped with a phenylethynyl amine or monofunctional amine. The polymerizations are carried out in polar aprotic solvents such as under nitrogen at room temperature. The amide acid oligomers or co-oligomers are subsequently cyclodehydrated to the corresponding imide oligomers. The polymers and copolymers prepared from these materials exhibit a unique and unexpected combination of properties.\n  
## 3                                                                                                                                                                                                Non-invasive measuring devices responsive to changes in a patient's intracranial pressure (ICP) can be accurately calibrated for monitoring purposes by providing known changes in ICP by non-invasive methods, such as placing the patient on a tilting bed and calculating a change in ICP from the tilt angle and the length of the patient's cerebrospinal column, or by placing a pressurized skull cap on the patient and measuring the inflation pressure. Absolute values for the patient's pressure-volume index (PVI) and the steady state ICP can then be determined by inducing two known changes in the volume of cerebrospinal fluid while recording the corresponding changes in ICP by means of the calibrated measuring device. The two pairs of data for pressure change and volume change are entered into an equation developed from an equation describing the relationship between ICP and cerebrospinal fluid volume. PVI and steady state ICP are then determined by solving the equation. Methods for inducing known changes in cerebrospinal fluid volume are described.\n  
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  A method for forming ferroelectric wafers is provided. A prestress layer is placed on the desired mold. A ferroelectric wafer is placed on top of the prestress layer. The layers are heated and then cooled, causing the ferroelectric wafer to become prestressed. The prestress layer may include reinforcing material and the ferroelectric wafer may include electrodes or electrode layers may be placed on either side of the ferroelectric layer. Wafers produced using this method have greatly improved output motion.\n  
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 A quasi four-level solid-state laser is provided. A laser crystal is disposed in a laser cavity. The laser crystal has a LuAG-based host material doped to a final concentration between about 2% and about 7% thulium (Tm) ions. For the more heavily doped final concentrations, the LuAG-based host material is a LuAG seed crystal doped with a small concentration of Tm ions. Laser diode arrays are disposed transversely to the laser crystal for energizing the Tm ions.\n  
##                                                                                                                                                                                                                                                                                                                                                                                                                          origin
## 1                                                                                                                                                                                                                                                                                                                                                                                                  (No Model.) A. I. CAMPBELL. 
## 2 The invention described herein was jointly made by employees of the United States Government, contract employees during the performance of work under a NASA contract which is subject to the provisions of Public Law 95-517 (35 USC 202) in which the contractor has elected not to retain title, and an employee of Rochester Gas and Electric Corporation during the performance of work under a Memorandum of Agreement.
## 3                                                                                                                                                                                                   This invention described herein was made by employees of the United States Government and may be manufactured and used by or for the Government or government purposes without payment of any royalties therein or thereof.
## 4                                                                                                                                                                     The invention described herein was made in the performance of work done by employees of the U.S. Government and may be manufactured and used by or for the government for governmental purposes without the payment of any royalties thereon or therefor.
## 5                                                                                                                                                                                                            The invention described herein was made by employees of the United States Government and may be used by and for the Government for governmental purposes without the payment of any royalties thereon or therefor.
##                                                                                                                                                                                                                                                                                                                                                             background
## 1                                                                                                                                                                                                                                                                                                                                   No. 464,750. Patented Dec.8,1891. 
## 2 This invention relates generally to oxidizing formaldehyde. It relates particularly to a process for oxidizing formaldehyde to carbon dioxide and water, which process includes exposing a gaseous mixture containing formaldehyde and an oxidizing agent to a catalyst of a noble metal dispersed on a metal oxide possessing more than one stable oxidation state.
## 3                                              The synthesis and characterization of PI has been extensively studied and documented. Reviews on PI are available. [J. W. Verbicky, Jr., "Polyimides" in Encyclopedia of Polymer Science and Engineering, 2nd Ed., John Wiley and Sons, New York, Vol. 12, 364 (1988); C. E. Sroog, Prog. Polym. Sci., 16, 591 (1991)].
## 4                                                                                                                                                                                                                                                                                                                                            1. Field of The invention
## 5                                                                                                                                                                                                    The present invention relates generally to ferroelectric devices, and more particularly to ferroelectric devices providing large mechanical output displacements.
##   citedBy citations filingDate
## 1       4         0 1891-12-08
## 2      30         7 1995-03-30
## 3      20         3 1995-08-04
## 4      59        17 1994-08-25
## 5     140         2 1995-04-04

EDA and data transformation for visualization

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Explore and Transform the Data

Examine the patent text files and view the text sections. We note 932 observations, 8 variables, and 2 integer classes and 6 character classes.

## Observations: 932
## Variables: 8
## $ appId      <chr> "US464750A", "US5585083A", "US5606014A", "US5617873A"…
## $ title      <chr> "Button-hole-scissors gage  - Google Patents", "Catal…
## $ abstract   <chr> "Disclosed is a process for oxidizing formaldehyde to…
## $ origin     <chr> "(No Model.) A. I. CAMPBELL. ", "The invention descri…
## $ background <chr> "No. 464,750. Patented Dec.8,1891. ", "This invention…
## $ citedBy    <int> 4, 30, 20, 59, 140, 8, 47, 10, 19, 96, 17, 12, 39, 32…
## $ citations  <int> 0, 7, 3, 17, 2, 1, 9, 1, 11, 8, 21, 16, 37, 13, 3, 11…
## $ filingDate <chr> "1891-12-08", "1995-03-30", "1995-08-04", "1994-08-25…
appId title abstract origin background citedBy citations filingDate
US464750A Button-hole-scissors gage - Google Patents Disclosed is a process for oxidizing formaldehyde to carbon dioxide and water without the addition of energy. A mixture of formaldehyde and an oxidizing agent (e.g., ambient air containing formaldehyde) is exposed to a catalyst which includes a noble metal dispersed on a metal oxide which possesses more than one oxidation state. Especially good results are obtained when the noble metal is platinum, and the metal oxide which possesses more than one oxidation state is tin oxide. A promoter (i.e., a small amount of an oxide of a transition series metal) may be used in association with the tin oxide to provide very beneficial results. (No Model.) A. I. CAMPBELL. No. 464,750. Patented Dec.8,1891. 4 0 1891-12-08
US5585083A Catalytic process for formaldehyde oxidation Controlled molecular weight imide oligomers and co-oligomers containing pendent phenylethynyl groups (PEPIs) and endcapped with nonreactive or phenylethynyl groups have been prepared by the cyclodehydration of the precursor amide acid oligomers or co-oligomers containing pendent phenylethynyl groups and endcapped with nonreactive or phenylethynyl groups. The amine terminated amide acid oligomers or co-oligomers are prepared from the reaction of dianhydride(s) with an excess of diamine(s) and diamine containing pendent phenylethynyl groups and subsequently endcapped with a phenylethynyl phthalic anhydride or monofunctional anhydride. The anhydride terminated amide acid oligomers and co-oligomers are prepared from the reaction of diamine(s) and diamine containing pendent phenylethynyl group(s) with an excess of dianhydride(s) and subsequently endcapped with a phenylethynyl amine or monofunctional amine. The polymerizations are carried out in polar aprotic solvents such as under nitrogen at room temperature. The amide acid oligomers or co-oligomers are subsequently cyclodehydrated to the corresponding imide oligomers. The polymers and copolymers prepared from these materials exhibit a unique and unexpected combination of properties. The invention described herein was jointly made by employees of the United States Government, contract employees during the performance of work under a NASA contract which is subject to the provisions of Public Law 95-517 (35 USC 202) in which the contractor has elected not to retain title, and an employee of Rochester Gas and Electric Corporation during the performance of work under a Memorandum of Agreement. This invention relates generally to oxidizing formaldehyde. It relates particularly to a process for oxidizing formaldehyde to carbon dioxide and water, which process includes exposing a gaseous mixture containing formaldehyde and an oxidizing agent to a catalyst of a noble metal dispersed on a metal oxide possessing more than one stable oxidation state. 30 7 1995-03-30

Data integrity

Check to make sure we are not operating on empty data. We are not.

## [1] TRUE

Track our documents in case we need to make extra rows for each word

##   docId     appId                                       title
## 1     1 US464750A Button-hole-scissors gage  - Google Patents
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             abstract
## 1 Disclosed is a process for oxidizing formaldehyde to carbon dioxide and water without the addition of energy. A mixture of formaldehyde and an oxidizing agent (e.g., ambient air containing formaldehyde) is exposed to a catalyst which includes a noble metal dispersed on a metal oxide which possesses more than one oxidation state. Especially good results are obtained when the noble metal is platinum, and the metal oxide which possesses more than one oxidation state is tin oxide. A promoter (i.e., a small amount of an oxide of a transition series metal) may be used in association with the tin oxide to provide very beneficial results.\n  
##                         origin                         background citedBy
## 1 (No Model.) A. I. CAMPBELL.  No. 464,750. Patented Dec.8,1891.        4
##   citations filingDate
## 1         0 1891-12-08

References 1. https://github.com/tidyverse/dplyr/issues/2047 2. https://tibble.tidyverse.org/reference/add_column.html 3. https://www.r-bloggers.com/the-notin-operator/

String-focused transformations

Remove third-party additions to text

## [1] "Button-hole-scissors gage  - Google Patents"
## [1] "Button-hole-scissors gage "

Titles and Abstracts

  1. Extract document ids, titles and abstracts and convert to tibble in order to use tidy text methods
  2. Replace common European characters into transliterated roman characters (e.g. â is a)
  3. Replace upper with lower case characters
  4. Tokenize the title and abstract variables into words
  5. Remove default English stopword from those words
  6. Normalize white-space: Remove leading, trailing and intermediate excess whitespace
  7. Remove numbers Hold off on punctuation (contractions, dashes) and stemming until analysis, and because key terms are more sensitive to these changes
  8. We note some words to remove after visualizing the text
##   docId      appId                                        title
## 1     1  US464750A                    button-hole-scissors gage
## 2     2 US5585083A catalytic process for formaldehyde oxidation
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            abstract
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       disclosed is a process for oxidizing formaldehyde to carbon dioxide and water without the addition of energy. a mixture of formaldehyde and an oxidizing agent (e.g., ambient air containing formaldehyde) is exposed to a catalyst which .
## 2 controlled molecular weight imide oligomers and co-oligomers containing pendent phenylethynyl groups (pepis) and endcapped with nonreactive or phenylethynyl groups have been prepared by the cyclodehydration of the precursor amide acid oligomers or co-oligomers containing pendent phenylethynyl groups and endcapped with nonreactive or phenylethynyl groups. the amine terminated amide acid oligomers or co-oligomers are prepared from the reaction of dianhydride(s) with an excess of diamine(s) and diamine containing pendent phenylethynyl groups and subsequently endcapped with a phenylethynyl phthalic anhydride or monofunctional anhydride. the anhydride terminated amide acid oligomers and co-oligomers are prepared from the reaction of diamine(s) and diamine containing pendent phenylethynyl group(s) with an excess of dianhydride(s) and subsequently endcapped with a phenylethynyl amine or monofunctional amine. the polymerizations are carried out in polar aprotic solvents such as under nitrogen at room temperature. the amide acid oligomers or co-oligomers are subsequently cyclodehydrated to the corresponding imide oligomers. the polymers and copolymers prepared from these materials exhibit a unique and unexpected combination of properties.
##                                                                                                                                                                                                                                                                                                                                                                                                                          origin
## 1                                                                                                                                                                                                                                                                                                                                                                                                   (No Model.) A. I. CAMPBELL.
## 2 The invention described herein was jointly made by employees of the United States Government, contract employees during the performance of work under a NASA contract which is subject to the provisions of Public Law 95-517 (35 USC 202) in which the contractor has elected not to retain title, and an employee of Rochester Gas and Electric Corporation during the performance of work under a Memorandum of Agreement.
##                                                                                                                                                                                                                                                                                                                                                             background
## 1                                                                                                                                                                                                                                                                                                                                    No. 464,750. Patented Dec.8,1891.
## 2 This invention relates generally to oxidizing formaldehyde. It relates particularly to a process for oxidizing formaldehyde to carbon dioxide and water, which process includes exposing a gaseous mixture containing formaldehyde and an oxidizing agent to a catalyst of a noble metal dispersed on a metal oxide possessing more than one stable oxidation state.
##   citedBy citations filingDate
## 1       4         0 1891-12-08
## 2      30         7 1995-03-30

Data-focused transformations

Character statistics

##   docId      appId citedBy citations filingDate
## 1     1  US464750A       4         0 1891-12-08
## 2     2 US5585083A      30         7 1995-03-30
##                                          title title_nchar title_nmeans
## 1                    button-hole-scissors gage          25           NA
## 2 catalytic process for formaldehyde oxidation          44           NA
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            abstract
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       disclosed is a process for oxidizing formaldehyde to carbon dioxide and water without the addition of energy. a mixture of formaldehyde and an oxidizing agent (e.g., ambient air containing formaldehyde) is exposed to a catalyst which .
## 2 controlled molecular weight imide oligomers and co-oligomers containing pendent phenylethynyl groups (pepis) and endcapped with nonreactive or phenylethynyl groups have been prepared by the cyclodehydration of the precursor amide acid oligomers or co-oligomers containing pendent phenylethynyl groups and endcapped with nonreactive or phenylethynyl groups. the amine terminated amide acid oligomers or co-oligomers are prepared from the reaction of dianhydride(s) with an excess of diamine(s) and diamine containing pendent phenylethynyl groups and subsequently endcapped with a phenylethynyl phthalic anhydride or monofunctional anhydride. the anhydride terminated amide acid oligomers and co-oligomers are prepared from the reaction of diamine(s) and diamine containing pendent phenylethynyl group(s) with an excess of dianhydride(s) and subsequently endcapped with a phenylethynyl amine or monofunctional amine. the polymerizations are carried out in polar aprotic solvents such as under nitrogen at room temperature. the amide acid oligomers or co-oligomers are subsequently cyclodehydrated to the corresponding imide oligomers. the polymers and copolymers prepared from these materials exhibit a unique and unexpected combination of properties.
##   abstract_nchar abstract_nmeans
## 1            235             235
## 2           1249            1249
##                                                                                                                                                                                                                                                                                                                                                                                                                          origin
## 1                                                                                                                                                                                                                                                                                                                                                                                                   (No Model.) A. I. CAMPBELL.
## 2 The invention described herein was jointly made by employees of the United States Government, contract employees during the performance of work under a NASA contract which is subject to the provisions of Public Law 95-517 (35 USC 202) in which the contractor has elected not to retain title, and an employee of Rochester Gas and Electric Corporation during the performance of work under a Memorandum of Agreement.
##   origin_nchar origin_nmeans
## 1           27      242.8637
## 2          413      242.8637
##                                                                                                                                                                                                                                                                                                                                                             background
## 1                                                                                                                                                                                                                                                                                                                                    No. 464,750. Patented Dec.8,1891.
## 2 This invention relates generally to oxidizing formaldehyde. It relates particularly to a process for oxidizing formaldehyde to carbon dioxide and water, which process includes exposing a gaseous mixture containing formaldehyde and an oxidizing agent to a catalyst of a noble metal dispersed on a metal oxide possessing more than one stable oxidation state.
##   background_nchar background_nmeans abs_nmeans
## 1               33          368.5944         NA
## 2              356          368.5944         NA

Dates

Store the dates in date format yyyy-mm-dd.

## Source: local data frame [1 x 18]
## Groups: <by row>
## 
## # A tibble: 1 x 18
##   docId appId citedBy citations filingDate title title_nchar title_nmeans
##   <int> <chr>   <int>     <int> <date>     <chr>       <int>        <dbl>
## 1     1 US46…       4         0 1891-12-08 butt…          25           NA
## # … with 10 more variables: abstract <chr>, abstract_nchar <int>,
## #   abstract_nmeans <int>, origin <chr>, origin_nchar <int>,
## #   origin_nmeans <dbl>, background <chr>, background_nchar <int>,
## #   background_nmeans <dbl>, abs_nmeans <dbl>

Final Data Definitions

  • docId: A unique patent document data record in our sample (NASA)
  • appId: A unique patent number ‘r’
  • citedBy: Other papers citing this paper, provided by Google
  • citation: Number of other papers this paper cites
  • filingDate: The date the patent was filed in yyyy-mm-dd format (for instance a specific version)
  • title: The patent title with the prepended phrase " - Google Patents" removed
  • title_nchar [patent_stats]: the number of characters in the title
  • title_nmeans [patent_stats]: the average number of characters over all of the titles in all patents
  • abstract: a summary of the patent abstract
  • abstract_nchar: the number of characters in the abstract
  • abstract_nmeans: the average number of characters over all of the abstracts in all patents
  • origin: a summary of how the patent originated
  • origin_nchar [patent_stats]: the number of characters in the origin
  • origin_nmeans [patent_stats]: the average number of characters over all of the origins in all patents
  • background: a summary of the patent background
  • background_nchar [patent_stats]: the number of characters in the background
  • background_nmeans [patent_stats]: the average number of characters over all of the backgrounds in all patents

References: 1. https://dplyr.tidyverse.org/reference/mutate.html#grouped-tibbles 2. https://dplyr.tidyverse.org/reference/summarise.html 3. https://github.com/tidyverse/dplyr/issues/2838 4. https://stackoverflow.com/questions/43897844/r-move-column-to-last-using-dplyr

Data Analysis

After looking into the patent stats, and transforming the data, we choose title and abstract for our Data Analysis.

Show Basic Word statistics

  1. Note the most frequent words over all patents
  2. Note the most frequent words per patent

Most frequent words in each title

## # A tibble: 5 x 3
##   docId word          n
##   <int> <chr>     <int>
## 1   203 ester         3
## 2   409 ester         3
## 3   711 flow          3
## 4     3 oligomers     2
## 5     4 pressure      2

Most frequent words in each Abstract

## # A tibble: 5 x 3
##   docId word        n
##   <int> <chr>   <int>
## 1   702 image      19
## 2   553 optical    17
## 3   787 flight     16
## 4   879 beam       16
## 5   879 sample     13

References: 1. https://stackoverflow.com/questions/20495598/replace-accented-characters-in-r-with-non-accented-counterpart-utf-8-encoding 2. https://www.tidytextmining.com/tidytext.html

Cast to Document Term Matrix

Title DTM

## <<DocumentTermMatrix (documents: 932, terms: 1973)>>
## Non-/sparse entries: 5003/1833833
## Sparsity           : 100%
## Maximal term length: NA
## Weighting          : term frequency (tf)

Abstract DTM

## <<DocumentTermMatrix (documents: 878, terms: 5037)>>
## Non-/sparse entries: 18996/4403490
## Sparsity           : 100%
## Maximal term length: NA
## Weighting          : term frequency (tf)

Perform Abstract LDA and Visualize

Build the LDA

## A LDA_VEM topic model with 24 topics.
##           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells 2578809 137.8    4197022 224.2         NA  4197022 224.2
## Vcells 5174493  39.5   10146329  77.5      32768  7712659  58.9
##           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells 2578700 137.8    4197022 224.2         NA  4197022 224.2
## Vcells 5174320  39.5   10146329  77.5      32768  7712659  58.9
##           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells 2578708 137.8    4197022 224.2         NA  4197022 224.2
## Vcells 5174351  39.5   10146329  77.5      32768  7712659  58.9

Experimenting with LDA

Trying another strategy, we can boost the number of topics and then extract the top term per document. Other strategies include creating bi- or tri-grams or combining TF-IDF and other Keywords such as those available in tidytextmining from NASA with the LDA topics or associated words.

Here we try to increase the topics to 100 (roughly 1/8 of the documents).

##           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells 2598373 138.8    4197022 224.2         NA  4197022 224.2
## Vcells 5559068  42.5   12255594  93.6      32768  8690708  66.4
## A LDA_VEM topic model with 100 topics.

Merge with dataset and extract to csv

Merge and get one topic per document

## # A tibble: 87,800 x 3
##    document topic     gamma
##    <chr>    <int>     <dbl>
##  1 702          1 0.0000718
##  2 553          1 0.0000539
##  3 787          1 0.0000744
##  4 879          1 0.0000642
##  5 261          1 0.000122 
##  6 278          1 0.0000521
##  7 279          1 0.0000649
##  8 432          1 0.0000940
##  9 2            1 0.0000678
## 10 7            1 0.0000642
## # … with 87,790 more rows
## # A tibble: 878 x 3
##    topic gamma docId
##    <int> <dbl> <int>
##  1    43 0.969     1
##  2    62 0.993     2
##  3    75 0.993     3
##  4    11 0.556     4
##  5    19 0.599     5
##  6    64 0.989     6
##  7    62 0.994     7
##  8    37 0.714     8
##  9    72 0.995     9
## 10    41 0.850    10
## # … with 868 more rows

Get one term per topic

## Warning in topic == topics_$docId: longer object length is not a multiple
## of shorter object length
## Warning: Length of logical index must be 1 or 100, not 878
## # A tibble: 12 x 2
##    topic term      
##    <int> <chr>     
##  1     1 conductive
##  2     2 material  
##  3     3 inflatable
##  4     4 signal    
##  5     5 single    
##  6     6 layer     
##  7     7 connecting
##  8     8 rotor     
##  9     9 band      
## 10    10 drill     
## 11    11 ratio     
## 12    12 data
## # A tibble: 87 x 4
##    topic gamma docId term      
##    <int> <dbl> <int> <chr>     
##  1    11 0.556     4 ratio     
##  2     6 0.994    16 layer     
##  3    11 0.989    24 ratio     
##  4     3 0.989    56 inflatable
##  5     3 0.790    67 inflatable
##  6     2 0.988    81 material  
##  7     3 0.989   100 inflatable
##  8     9 0.990   102 band      
##  9     1 0.993   109 conductive
## 10     4 0.472   121 signal    
## # … with 77 more rows

Check for errors

## [1] FALSE

Merge to patent document df

## # A tibble: 6 x 12
## # Groups:   docId [6]
##   docId appId title abstract origin background citedBy citations filingDate
##   <int> <chr> <chr> <chr>    <chr>  <chr>        <int>     <int> <chr>     
## 1     4 US56… non-… a for f… The i… 1. Field …      59        17 1994-08-25
## 2    16 US57… diff… an and … The i… Periodont…      27        11 1996-09-09
## 3    24 US58… quan… an auto… The i… "Tumor ce…       8         7 1993-07-27
## 4    56 US62… endo… an infl… This … The inven…      27         9 1996-04-17
## 5    67 US63… and … a porta… This … 1. Techni…      46         2 1999-02-02
## 6    81 US64… high… cartesi… The i… The prese…      17         5 2000-08-22
## # … with 3 more variables: topic <int>, gamma <dbl>, term <chr>

Extract

Reference: 1. https://www.tidytextmining.com/tidytext.html 2. https://www.kaggle.com/rtatman/nlp-in-r-topic-modelling#Unsupervised-topic-modeling-with-LDA

TF-IDF

Measures the importance of a word within the corpus, potentially surfacing not unique words to the document (tf) but those that are not very frequent in the corpus (IDF). We investigate this potentially more reliable method (given the small dataset (corpus)) until we are able to expand or augment the LDA topic analysis and then test it.

Find the top one TF-IDF term per document

## # A tibble: 5 x 6
##   docId word              n     tf   idf tf_idf
##   <int> <chr>         <int>  <dbl> <dbl>  <dbl>
## 1     1 formaldehyde      3 0.158   6.08  0.961
## 2     2 oligomers        11 0.124   5.17  0.639
## 3     3 icp               7 0.0864  5.39  0.466
## 4     4 ferroelectric     1 0.333   5.39  1.80 
## 5     5 quasi             1 0.25    5.68  1.42

Merge with dataset and extract to csv

Merge

docId term appId title abstract origin background citedBy citations filingDate
1 formaldehyde US464750A button-hole-scissors gage disclosed is a process for oxidizing formaldehyde to carbon dioxide and water without the addition of energy. a mixture of formaldehyde and an oxidizing agent (e.g., ambient air containing formaldehyde) is exposed to a catalyst which . (No Model.) A. I. CAMPBELL. No. 464,750. Patented Dec.8,1891. 4 0 1891-12-08
2 oligomers US5585083A catalytic process for formaldehyde oxidation controlled molecular weight imide oligomers and co-oligomers containing pendent phenylethynyl groups (pepis) and endcapped with nonreactive or phenylethynyl groups have been prepared by the cyclodehydration of the precursor amide acid oligomers or co-oligomers containing pendent phenylethynyl groups and endcapped with nonreactive or phenylethynyl groups. the amine terminated amide acid oligomers or co-oligomers are prepared from the reaction of dianhydride(s) with an excess of diamine(s) and diamine containing pendent phenylethynyl groups and subsequently endcapped with a phenylethynyl phthalic anhydride or monofunctional anhydride. the anhydride terminated amide acid oligomers and co-oligomers are prepared from the reaction of diamine(s) and diamine containing pendent phenylethynyl group(s) with an excess of dianhydride(s) and subsequently endcapped with a phenylethynyl amine or monofunctional amine. the polymerizations are carried out in polar aprotic solvents such as under nitrogen at room temperature. the amide acid oligomers or co-oligomers are subsequently cyclodehydrated to the corresponding imide oligomers. the polymers and copolymers prepared from these materials exhibit a unique and unexpected combination of properties. The invention described herein was jointly made by employees of the United States Government, contract employees during the performance of work under a NASA contract which is subject to the provisions of Public Law 95-517 (35 USC 202) in which the contractor has elected not to retain title, and an employee of Rochester Gas and Electric Corporation during the performance of work under a Memorandum of Agreement. This invention relates generally to oxidizing formaldehyde. It relates particularly to a process for oxidizing formaldehyde to carbon dioxide and water, which process includes exposing a gaseous mixture containing formaldehyde and an oxidizing agent to a catalyst of a noble metal dispersed on a metal oxide possessing more than one stable oxidation state. 30 7 1995-03-30

Extract

Reference: 1. https://www.tidytextmining.com/tidytext.html

Appendices

Retrieve USPTO patent data

The code chunk for the USPTO acquistion shall be used when documents beyond the Google archive are needed.

View Patents View Information (for instance, the Abstract)

## [1] "app_country"       "app_date"          "app_number"       
## [4] "app_type"          "appcit_app_number" "appcit_category"
## $data
## #### A list with a single data frame (with list column(s) inside) on a patent level:
## 
## List of 1
##  $ patents:'data.frame': 1 obs. of  32 variables:
##   ..$ detail_desc_length                    : chr "21556"
##   ..$ patent_abstract                       : chr "A chemochromic sensor"..
##   ..$ patent_average_processing_time        : chr "1590"
##   ..$ patent_date                           : chr "2012-10-23"
##   ..$ patent_firstnamed_assignee_city       : chr "Washington"
##   ..$ patent_firstnamed_assignee_country    : chr "US"
##   ..$ patent_firstnamed_assignee_id         : chr "org_EolmLkaBf9MsnLD1f"..
##   ..$ patent_firstnamed_assignee_latitude   : chr "38.895"
##   ..$ patent_firstnamed_assignee_location_id: chr "38.895|-77.0367"
##   ..$ patent_firstnamed_assignee_longitude  : chr "-77.0367"
##   ..$ patent_firstnamed_assignee_state      : chr "DC"
##   ..$ patent_firstnamed_inventor_city       : chr "Titusville"
##   ..$ patent_firstnamed_inventor_country    : chr "US"
##   ..$ patent_firstnamed_inventor_id         : chr "7790787-4"
##   ..$ patent_firstnamed_inventor_latitude   : chr "28.6119"
##   ..$ patent_firstnamed_inventor_location_id: chr "28.6119|-80.8078"
##   ..$ patent_firstnamed_inventor_longitude  : chr "-80.8078"
##   ..$ patent_firstnamed_inventor_state      : chr "FL"
##   ..$ patent_id                             : chr "8293178"
##   ..$ patent_kind                           : chr "B2"
##   ..$ patent_num_cited_by_us_patents        : chr "0"
##   ..$ patent_num_claims                     : chr "14"
##   ..$ patent_num_combined_citations         : chr "10"
##   ..$ patent_num_foreign_citations          : chr "0"
##   ..$ patent_num_us_application_citations   : chr "5"
##   ..$ patent_num_us_patent_citations        : chr "5"
##   ..$ patent_number                         : chr "8293178"
##   ..$ patent_processing_time                : chr "1813"
##   ..$ patent_title                          : chr "Chemochromic detector"..
##   ..$ patent_type                           : chr "utility"
##   ..$ patent_year                           : chr "2012"
##   ..$ inventors                             :List of 1
## 
## $query_results
## #### Distinct entity counts across all downloadable pages of output:
## 
## total_patent_count = 1
## [1] "A chemochromic sensor for detecting a combustible gas, such as hydrogen, includes a chemochromic pigment mechanically mixed with a polymer and formed into a rigid or pliable material. In a preferred embodiment, the chemochromic detector includes aerogel material. The detector is robust and easily modifiable for a variety of applications and environmental conditions, such as atmospheres of inert gas, hydrogen gas, or mixtures of gases, or in environments that have variable temperature, including high temperatures such as above 100° C. and low temperatures such as below −196° C."

Research papers

Explore PLOS. Commentary in code chunk. Unhide to review.

##                            field                     description
## 10                      abstract                Abstract section
## 49                abstract_ngram                            <NA>
## 11      abstract_primary_display                Abstract section
## 21                 accepted_date                   Accepted Date
## 38                     affiliate                       Affiliate
## 50               affiliate_facet                            <NA>
## 5                alternate_title               Alternative Title
## 25                  article_type                    Article Type
## 51            article_type_facet                            <NA>
## 6                         author                          Author
## 52              author_affiliate                            <NA>
## 9     author_collab_only_display                          Author
## 7                 author_display                          Author
## 53                  author_facet                            <NA>
## 39                  author_notes                    Author Notes
## 8  author_without_collab_display                          Author
## 18                          body    Most sections of the article
## 54                    body_ngram                            <NA>
## 55                      body_rev                            <NA>
## 40            competing_interest    Competing Interest Statement
## 15                   conclusions             Conclusions section
## 45                     copyright             copyright-statement
## 42             counter_total_all           Total views, all time
## 43           counter_total_month       Total views, last 30 days
## 48 cross_published_journal_eissn                  no description
## 47   cross_published_journal_key     Cross Published Journal Key
## 46  cross_published_journal_name    Cross Published Journal Name
## 56              doc_partial_body                            <NA>
## 57         doc_partial_parent_id                            <NA>
## 58              doc_partial_type                            <NA>
## 59                      doc_type                            <NA>
## 36                        editor                          Editor
## 60              editor_affiliate                            <NA>
## 37                editor_display                          Editor
## 61                  editor_facet                            <NA>
## 28                         eissn                 electronic ISSN
## 30                  elocation_id             Electronic Location
## 2                     everything         All text in the article
## 62              everything_ngram                            <NA>
## 63          everything_noprocess                            <NA>
## 64                everything_rev                            <NA>
## 65          figure_table_caption                            <NA>
## 41          financial_disclosure  Financial Disclosure Statement
## 1                             id DOI (Digital Object Identifier)
## 12                  introduction            Introduction section
## 24                         issue                           Issue
## 22                       journal               Full Journal Name
## 32             journal_id_nlm_ta               Journal ID at NLM
## 31                journal_id_pmc               Journal ID at PMC
## 33          journal_id_publisher       Publisher of this Journal
## 13         materials_and_methods   Materials and Methods section
## 35                     pagecount           Total number of pages
## 29                         pissn                      print ISSN
## 19              publication_date                Publication Date
## 34                     publisher       Publisher of this Article
## 20                 received_date                   Received Date
## 17                     reference               Reference section
## 14        results_and_discussion  Results and discussion section
## 26                       subject                Subject Category
## 66                 subject_facet                            <NA>
## 67             subject_hierarchy                            <NA>
## 27               subject_level_1                Subject Category
## 68                      subject2                            <NA>
## 69                subject2_facet                            <NA>
## 70            subject2_hierarchy                            <NA>
## 71              subject2_level_1                            <NA>
## 16        supporting_information  Supporting Information section
## 44                     timestamp              Time of last index
## 3                          title                   Article Title
## 4                  title_display                   Article Title
## 72                   title_ngram                            <NA>
## 23                        volume                          Volume
##                                                                           note
## 10                                                                     no note
## 49                                                                        <NA>
## 11                            For display purposes only. Primary abstract only
## 21                                                 Requires start and end date
## 38                                                    Can have multiple values
## 50                                                                        <NA>
## 5                                                                      no note
## 25                                                                     no note
## 51                                                                        <NA>
## 6                                                     Can have multiple values
## 52                                                                        <NA>
## 9                        For display purposes only. Collaborative authors only
## 7                                                    For display purposes only
## 53                                                                        <NA>
## 39                                                                     no note
## 8  For display purposes only. All the authors except for collaborative authors
## 18                                              Without Abstract or References
## 54                                                                        <NA>
## 55                                                                        <NA>
## 40                                                                     no note
## 15                                                                     no note
## 45                                                       copyright information
## 42                                                                     no note
## 43                                                                     no note
## 48         PLoS-specific indexes for articles that appear in multiple journals
## 47         PLoS-specific indexes for articles that appear in multiple journals
## 46         PLoS-specific indexes for articles that appear in multiple journals
## 56                                                                        <NA>
## 57                                                                        <NA>
## 58                                                                        <NA>
## 59                                                                        <NA>
## 36                                                    Can have multiple values
## 60                                                                        <NA>
## 37                                                  For display purposes only.
## 61                                                                        <NA>
## 28                                                                     no note
## 30                                                     Used by Pub Med Central
## 2                                                    Includes Meta information
## 62                                                                        <NA>
## 63                                                                        <NA>
## 64                                                                        <NA>
## 65                                                                        <NA>
## 41                                                                     no note
## 1                                               Extended for partial documents
## 12                                                                     no note
## 24                                                                     no note
## 22                                                                     no note
## 32                                    Used by the National Library of Medicine
## 31                                                     Used by Pub Med Central
## 33                                                            Short identifier
## 13                                                                     no note
## 35                                            Not all articles have page count
## 29                                                                     no note
## 19                                                 Requires start and end date
## 34                                                                   Full name
## 20                                                 Requires start and end date
## 17                                                    Can have multiple values
## 14                                                                     no note
## 26                                                    Can have multiple values
## 66                                                                        <NA>
## 67                                                                        <NA>
## 27             Can have multiple values. Contains only the top level subjects.
## 68                                                                        <NA>
## 69                                                                        <NA>
## 70                                                                        <NA>
## 71                                                                        <NA>
## 16                                                                     no note
## 44                                                                     no note
## 3                                                                      no note
## 4                                                    For display purposes only
## 72                                                                        <NA>
## 23                                                                     no note
## [1] "PLoSONE"            "PLoSGenetics"       "PLoSPathogens"     
## [4] "PLoSNTD"            "PLoSCompBiol"       "PLoSBiology"       
## [7] "PLoSMedicine"       "PLoSClinicalTrials"
## $meta
## # A tibble: 1 x 2
##   numFound start
##      <int> <int>
## 1       30     0
## 
## $data
## # A tibble: 10 x 2
##    id                             title                                    
##    <chr>                          <chr>                                    
##  1 10.1371/journal.pcbi.1005681   Self-regulation strategy, feedback timin…
##  2 10.1371/journal.pcbi.1005681/… <NA>                                     
##  3 10.1371/journal.pcbi.1005681/… <NA>                                     
##  4 10.1371/journal.pcbi.1005681/… <NA>                                     
##  5 10.1371/journal.pcbi.1005681/… <NA>                                     
##  6 10.1371/journal.pcbi.1005681/… <NA>                                     
##  7 10.1371/journal.pcbi.1005681/… <NA>                                     
##  8 10.1371/journal.pcbi.1005681/… <NA>                                     
##  9 10.1371/journal.pone.0053040   Macroautophagy Abnormality in Essential …
## 10 10.1371/journal.pone.0053040/… <NA>
## # A tibble: 1 x 2
##   numFound start
##      <int> <int>
## 1     1967     0
## # A tibble: 20 x 4
##    id          publication_date  abstract               title              
##    <chr>       <chr>             <chr>                  <chr>              
##  1 10.1371/jo… 2013-04-22T00:00… "\nAbnormal α-synucle… Molecular Ageing o…
##  2 10.1371/jo… 2014-02-25T00:00… "\nα-Synuclein is the… Differential Expre…
##  3 10.1371/jo… 2011-01-31T00:00… "\n        Genetic an… Resistance to MPTP…
##  4 10.1371/jo… 2012-12-31T00:00… "\n        α-Synuclei… p62/SQSTM1-Depende…
##  5 10.1371/jo… 2013-04-25T00:00… "α-synuclein dysregul… Alpha-Synuclein In…
##  6 10.1371/jo… 2012-08-08T00:00… "\n        Phospholip… γ-Synuclein Intera…
##  7 10.1371/jo… 2011-07-14T00:00… "\n        Genetic, b… Assessment of α-Sy…
##  8 10.1371/jo… 2012-12-17T00:00… "\n        α-synuclei… α-Synuclein and An…
##  9 10.1371/jo… 2013-02-20T00:00… "\n        Cigarette … Human α4β2 Nicotin…
## 10 10.1371/jo… 2010-05-05T00:00… "Background: Melanoma… Parkinson's Diseas…
## 11 10.1371/jo… 2013-01-22T00:00… "\n        Amyloid fi… Temperature-Depend…
## 12 10.1371/jo… 2011-12-07T00:00… "\n        Alpha-synu… Redistribution of …
## 13 10.1371/jo… 2014-07-07T00:00… "\nSynucleinopathies,… Novel AAV-Based Ra…
## 14 10.1371/jo… 2015-04-06T00:00… "\nThere is unequivoc… Alpha-Synuclein Le…
## 15 10.1371/jo… 2013-05-07T00:00… "\nWhile most forms o… Impairment of Mito…
## 16 10.1371/jo… 2011-10-31T00:00… "\n        Recent res… Antibodies against…
## 17 10.1371/jo… 2010-08-11T00:00… "\nThe protein α-synu… α-Synuclein Suppre…
## 18 10.1371/jo… 2009-08-14T00:00… "\nIn synucleinopathi… Parkin Deficiency …
## 19 10.1371/jo… 2017-02-10T00:00… "\nα-Synuclein misfol… α-Synuclein increa…
## 20 10.1371/jo… 2012-04-27T00:00… "\n        α-Synuclei… Role of Alpha-Synu…

Explore one patent at a time

  • appId: US6226553B1
  • title: US5689004ADiamines containing pendent phenylethynyl groups
  • search: Find papers whose body contains the term “phenylethynyl”.
## # A tibble: 1 x 2
##   numFound start
##      <int> <int>
## 1       40     0
## # A tibble: 10 x 4
##    id          publication_date  abstract               title              
##    <chr>       <chr>             <chr>                  <chr>              
##  1 10.1371/jo… 2014-05-23T00:00… "\nWe studied pattern… Structural Probing…
##  2 10.1371/jo… 2014-12-30T00:00… "\nSquare wave voltam… Square Wave Voltam…
##  3 10.1371/jo… 2016-10-17T00:00… "\nPenicillin binding… Oxazin-5-Ones as a…
##  4 10.1371/jo… 2011-06-17T00:00… "\n        Highly sel… Design, Synthesis …
##  5 10.1371/jo… 2007-08-22T00:00… Optimization of a ser… Computer-Aided Lea…
##  6 10.1371/jo… 2012-10-23T00:00… "\n        Drug toxic… In Situ Mass Spect…
##  7 10.1371/jo… 2014-07-25T00:00… "\nAntagonists of met… The mGluR5 Antagon…
##  8 10.1371/jo… 2008-05-14T00:00… Hippocampal synaptic … MGluR5 Mediates th…
##  9 10.1371/jo… 2017-11-27T00:00… "\nPrion infections c… Inhibition of grou…
## 10 10.1371/jo… 2019-08-21T00:00… "\nInhibitory glycine… mGluR5/ERK signali…
  • appId: US6261844B1
  • title: US5730806AGas-liquid supersonic cleaning and cleaning verification spray system
  • search: - search: Find papers whose body contains the term “liquid supersonic cleaning.”
## # A tibble: 1 x 2
##   numFound start
##      <int> <int>
## 1       13     0
## # A tibble: 10 x 4
##    id          publication_date  abstract               title              
##    <chr>       <chr>             <chr>                  <chr>              
##  1 10.1371/jo… 2013-07-16T00:00… "\nWe report on the s… Nanoscale Roughnes…
##  2 10.1371/jo… 2018-06-01T00:00… "\nSugarcane bagasse … A comparative stud…
##  3 10.1371/jo… 2013-11-11T00:00… "\nThe development of… Comparison of Nume…
##  4 10.1371/jo… 2017-04-11T00:00… "\nThe considerable m… Hydrophobic pinnin…
##  5 10.1371/jo… 2014-01-21T00:00… "\nExenatide is an FD… Oral Delivery of E…
##  6 10.1371/jo… 2019-10-24T00:00… "\nArraying individua… Extracellular vesi…
##  7 10.1371/jo… 2016-10-27T00:00… "\nThe Cu-Li-Sn phase… The Cu-Li-Sn Phase…
##  8 10.1371/jo… 2017-06-08T00:00… "\nThe recent episode… Application of aco…
##  9 10.1371/jo… 2012-12-05T00:00… "Objective: Chronic r… Macrophages Facili…
## 10 10.1371/jo… 2012-02-20T00:00… "\n        There has … Quantitative Model…

Title terms

Search for papers whose bodies contain the top words from NASA patent titles.

## Selecting by n
## [1] "Search term: carbon"
## [1] "Count: 29619"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2013-08-26T00:00… "\nChina has been exp… Organic Carbon Stor…
## 2 10.1371/jo… 2013-01-10T00:00… "\n        Phenotypic… Coevolution Trumps …
## 3 10.1371/jo… 2012-09-14T00:00… "\n        Monitoring… Towards Regional, E…
## 4 10.1371/jo… 2015-03-16T00:00… "\nSoil type and fert… Dynamics of Maize C…
## 5 10.1371/jo… 2016-08-05T00:00… "\nThe alpine grassla… Ecosystem Carbon St…
## # … with 1 more row
## [1] "Search term: sensor"
## [1] "Count: 16289"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2018-08-29T00:00… "\nBuilding predictiv… Infinitely large, r…
## 2 10.1371/jo… 2015-10-23T00:00… "\nMagnetic biosensor… Configurational Sta…
## 3 10.1371/jo… 2018-10-09T00:00… "Background: In diabe… Effect of sensor lo…
## 4 10.1371/jo… 2015-05-07T00:00… "\nDetecting spreadin… Detecting the Influ…
## 5 10.1371/jo… 2014-03-04T00:00… "\nWe address the pro… Feature Selection f…
## # … with 1 more row
## [1] "Search term: based"
## [1] "Count: 229327"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2010-05-10T00:00… "Background: During t… Transcription-Assoc…
## 2 10.1371/jo… 2016-11-02T00:00… "\nLander-Waterman’s … Breaking Lander-Wat…
## 3 10.1371/jo… 2015-12-07T00:00… "\nThe introduction o… Introducing Compute…
## 4 10.1371/jo… 2017-10-27T00:00… "\nThe challenge of d… Hybrid self-optimiz…
## 5 10.1371/jo… 2018-01-31T00:00… "Background: The last… Historical trends i…
## # … with 1 more row
## [1] "Search term: composite"
## [1] "Count: 60646"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2017-01-20T00:00… "\nMussel-inspired ap… Thermal Conductivit…
## 2 10.1371/jo… 2018-04-13T00:00… "Objective: To study … Preparation and cha…
## 3 10.1371/jo… 2012-02-08T00:00… "\n        Most biolo… Composite Structura…
## 4 10.1371/jo… 2015-12-29T00:00… "\nAny release of ant… Composite Sampling …
## 5 10.1371/jo… 2018-09-24T00:00… "\nProteins with low-… Proteome-scale rela…
## # … with 1 more row
## [1] "Search term: sensing"
## [1] "Count: 45688"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2015-09-25T00:00… "\nWidely distributed… Heterogeneous Parti…
## 2 10.1371/jo… 2018-03-05T00:00… "\nCell size is thoug… A computational mod…
## 3 10.1371/jo… 2016-07-25T00:00… "\nThe rubber hand il… ‘Robot’ Hand Illusi…
## 4 10.1371/jo… 2012-12-20T00:00… "\n        Natural an… A Genome-Wide Inves…
## 5 10.1371/jo… 2011-02-02T00:00… "\n        The Arabid… Sense and Antisense…
## # … with 1 more row
## [1] "Search term: optical"
## [1] "Count: 32292"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2016-10-26T00:00… "Purpose: To investig… The Effect of Optic…
## 2 10.1371/jo… 2012-09-04T00:00… "Background: Research… A Novel Animal Mode…
## 3 10.1371/jo… 2015-10-01T00:00… "Purpose: To assess t… Glaucomatous-Type O…
## 4 10.1371/jo… 2007-02-07T00:00… "\n            The re… A Functional Archit…
## 5 10.1371/jo… 2009-10-13T00:00… "\nIn this Research A… Dynamic Coupling of…
## # … with 1 more row
## [1] "Search term: process"
## [1] "Count: 189114"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2018-10-23T00:00… "\nOrganizational pro… Development and val…
## 2 10.1371/jo… 2010-12-03T00:00… "\n        We present… Feller Processes: T…
## 3 10.1371/jo… 2017-04-27T00:00… "\nMetabolic disorder… Tracking disease pr…
## 4 10.1371/jo… 2019-02-20T00:00… "\nTropidolaemus wagl… Description of cran…
## 5 10.1371/jo… 2009-04-23T00:00… "Background: The trad… Biological Process …
## # … with 1 more row
## [1] "Search term: thermal"
## [1] "Count: 19021"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2017-03-13T00:00… "\nThermal properties… Measurement of ther…
## 2 10.1371/jo… 2011-06-17T00:00… "\n        Most organ… Isopods Failed to A…
## 3 10.1371/jo… 2014-12-22T00:00… "\nThermal energy tra… Thermophysical Prop…
## 4 10.1371/jo… 2016-02-03T00:00… "\nClimate change is … Ontogenetic Variati…
## 5 10.1371/jo… 2015-05-18T00:00… "\nThermal conductivi… Large Thermal Condu…
## # … with 1 more row
## [1] "Search term: laser"
## [1] "Count: 22406"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2014-12-26T00:00… "Objective: To evalua… Ranibizumab Monothe…
## 2 10.1371/jo… 2018-09-06T00:00… "\nPicosecond lasers … Effects of picoseco…
## 3 10.1371/jo… 2018-11-29T00:00… "\nThe nucleus accumb… Optogenetic self-st…
## 4 10.1371/jo… 2015-07-10T00:00… "\nThe mouse model of… Optimization of an …
## 5 10.1371/jo… 2013-12-11T00:00… "\nSafe and effective… Near-Infrared Laser…
## # … with 1 more row
## [1] "Search term: monitoring"
## [1] "Count: 89594"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2015-05-29T00:00… "\nThe objective of t… Optimal Design of R…
## 2 10.1371/jo… 2015-05-08T00:00… "\nIn this study, we … Do Parents Meet Ado…
## 3 10.1371/jo… 2013-02-28T00:00… "Objectives: Mortalit… Monitoring of Antir…
## 4 10.1371/jo… 2015-03-20T00:00… "Background: The cost… The Cost-Effectiven…
## 5 10.1371/jo… 2017-08-23T00:00… "\nPatients with Park… Verbal monitoring i…
## # … with 1 more row
## [1] "Search term: nanotube"
## [1] "Count: 532"
## [1] "Sample of documents found..."
## # A tibble: 6 x 4
##   id          publication_date  abstract               title               
##   <chr>       <chr>             <chr>                  <chr>               
## 1 10.1371/jo… 2013-10-04T00:00… "\nIn this study, Ag … Both Enhanced Bioco…
## 2 10.1371/jo… 2017-04-12T00:00… "\nNanotubes are form… New route for self-…
## 3 10.1371/jo… 2014-01-02T00:00… "\nNature routinely c… Composition Based S…
## 4 10.1371/jo… 2012-05-24T00:00… "\n        We present… Carbon Nanotube Sol…
## 5 10.1371/jo… 2006-04-28T00:00… "Here our goal is to … Designing a Nanotub…
## # … with 1 more row

Resources

Collaboration

References

Agencies and services

  • WIPO. The World Intellectual Property Organization.
  • Public Library of Science. PLOS. 200K+ open access peer-reviewed journal articles.
  • CORE. Claims to be the largest aggregator of open access research papers, with an inventory of 135M+. Provides APIs for text mining.
  • rOpenSci. Non-profit that addresses reproducibility of scientific data retrieval. A raft of R packages for working with scientific data sources and supporting research. Leadership includes Jenny Bryan and advisors include Hadley Wickham. Blog, tutorials, and videos.
  • Allen Institute for Artificial Intelligence. Semantic Scholar, etc.
  • One World Analytics. Analyzes social and environmental issues using text mining, statistics, geographic mapping and network visualisation.

US patent data

  • USPTO Open Data Portal. United States Patent and Trademark Office data products. Bulk downloads, APIs, analytics, etc.
  • PatentsView. Secondary source, with dataset downloads and APIs.

R packages

  • rplos Vignette. Interface to the Solr based search API for PLOS journals. Functions search for articles, retrieve articles, make plots, do faceted searches, highlight searches, and present results of highlighted searches in a browser.
  • fulltext Vignette. Facilitates text mining with an emphasis on open access journals. The vignette provides examples.

APIs

  • PLOS API. Documentation, articles, frequently asked questions.

Examples